Script description:

The aim of this script is to perform a statistical analysis of untargeted metabolomics data which has been previously pre-treated with the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”. This performs several tasks to achieve this end:

  1. Read the raw data files from the input folder.

  2. Normalize data set with the package “NormalyzerDE” without the need of a design input (it is created with the information given in the input files).

  3. After the review and evaluation of the normalization report by the researcher, the selected data normalized is uploaded by changing line 178 with the corresponding file. You can choose among the following options: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. If you do not want to use any normalized file, then set it as “None”.

  4. Data visualization before and after the normalization with box plot from “NormalyMets” package.

  5. Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) from packages “FactoMineR” and “factoextra” for normalized dataset to check the right group differentiation and a possible batch effect.

  6. Statistical analysis, using the “Limma” package, performed to extract the p-values and log2 fold change values (LFC) for every metabolite in the conducted comparative analysis. Additionally, p-value adjustment methods, such as Benjamini-Hochberg, Bonferroni, and q-values were applied. Tables with the results are given, as well as plots displaying the original p-values and q-values, which are available for visualization.

  7. The identification of significant metabolites will be based on the statistical thresholds set at point 4 of the usage instructions. A message will be displayed indicating the number of decreased and increased metabolites for each adjustment method, taking into account the linear model used in the Limma analysis. This will provide insights into the directionality of differential expression for the identified metabolites. [As the test model was set up as_“contrast <- paste0(group[2],”-“,group[1])”. _This instruction represent the measures difference in expression levels between Group 2 and Group 1. A positive LFC indicates higher expression in Group 2 compared to Group 1, while a negative LFC indicates higher expression in Group 1 compared to Group 2. The magnitude of the LFC represents the change in expression between the two groups.] The number of significant differences will be represented by venn-diagrams (“ggVennDiagram” package) and upset plot (“UpSetR” package).

  8. After having selected a p-value adjustment method in line 750, the significant metabolites based on the statistical thresholds will be plotted by Mean-Average plots and Volcano plots, also a dynamic volcano plot is displayed where you can see the information for each dot (metabolite). Tables with the 10 most significant metabolites for each comparative will be displayed.

  9. Finally, a sparse Partial Least Squares Discriminant Analysis (sPLS-DA) will be conducted using the “mixOmics” package. This analysis uses 2 components and 25 variables. The results for the sPLS-DA analysis include a loadings plot, individual samples plot, variable plot, and a prediction essay. Additionally, predictions results and the top 10 most important variable predictors for each comparative will be presented in tables, providing valuable insights into the predictive power of the model and the variables driving the differences between the comparatives.

To run this script, you need to provide the input and output folder addresses, as well as the statistical thresholds.


USAGE INSTRUCTIONS:

  1. The environment must be clear. If not, run this “chunk”:
rm(list = ls())
  1. Input data must proceeded from the script “Prepare_Datasets_to_Statistical_Analysis.Rmd”, because they must have the words “one_factor” in their name.

  2. You will need the following packages

packages <- c("tidyverse", "plotly", "mixOmics", "devtools", "NormalizeMets", "tools", "kableExtra",
              "qvalue", "NormalyzerDE", "limma", "FactoMineR", "factoextra", "ggVennDiagram", "UpSetR", "ggpubr",
              "ggrepel")

for (package in packages) {
  if (!requireNamespace(package, quietly = TRUE)) {
    if (package == "NormalizeMets") {
      devtools::install_github("metabolomicstats/NormalizeMets")
    } else {
      install.packages(package)
    }
  }
  library(package, character.only = TRUE)
}
  1. Set the input, output folders path here, as well as the statistical limits that will be used in the analysis
# Please indicate if your data NA values have been impute in the previous script or not. Yes/No
imputed <- "No"

if (imputed == "Yes") {
  input_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use"
  dir.create(paste0(input_folder, "/Normalyze_data"))
  output_folder <- "../1_Data/Untargeted_metabolomic_data/Imputed_ready_to_use/Normalyze_data"
  dir.create("../Images/Imputed_separate_comparatives")
  image_folder <- "../Images/Imputed_separate_comparatives"
} else {
  input_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use"
  dir.create(paste0(input_folder, "/Normalyze_data"))
  output_folder <- "../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data"
  dir.create("../Images/Common_metabolites")
  image_folder <- "../Images/Common_metabolites"
}

# Set Normalize method on line 183 #
# Set adjusted p-value on line 750 #

# Set the statistical parameters for the significant analysis
stats_values <- list()
LFC_limit <- "LFC_limit"; stats_values[[LFC_limit]] <- 2
alpha_value <- "alpha_value"; stats_values[[alpha_value]] <- 0.01

files <- list.files(input_folder)

files <- files[grepl("one_factor", files, ignore.case = T)]

——————————————————————————–


Previous steps to perform with my datasets

PERSONAL USE

To ensure the correc organization of the comparative A columns and accurate assignment of the comparative E groups, it is necessary to execute this specific code section for my data.

for (i in files){
  # Read data files
  data_path <- file.path(input_folder, i)
  data <- read.csv(data_path)
  group <- data [1, ] %>% .[-1] %>% as.character(.) %>% unique(.)
  
  # Perform the changes if necessary
  if (group[1] == "cA_S17" & group[2] == "cA_S25") {
    data <- data[, c(1, 5:7, 2:4)]
    # Overwrite the original file
    write.csv(data, file = data_path, row.names = FALSE)
  } else if (group[1] == "cA_S17" & group[2] == "cB_OTA_S17") {
    group <- c("group", rep("cE_noOTA", 3), rep("cE_OTA", 3))
    data <- data [-1, ] %>% rbind(group, .)
    # Overwrite the original file
    write.csv(data, file = data_path, row.names = FALSE)
  } 
}

files <- list.files(input_folder)
files <- files[grepl("one_factor", files, ignore.case = T)]

Normalization

This code section iterates over a list of files, reads each file’s data, prepares the dataset by assigning row names and extracting group information, writes the design and raw data to separate files, applies the normalyzer function for normalization, saves the normalized result, and finally removes temporary files.

for (i in files){
    # Read data files
    data_path <- file.path(input_folder, i)
    raw_data <- read.csv(data_path)
    
    # Prepare dataset
    rownames(raw_data) <- raw_data$name
    group <- raw_data[1, -1] %>% as.character()
    raw_data <- raw_data[-1, -1]

    # Design and temporal objects
    design <- data.frame(sample=colnames(raw_data), group=group)

    write.table(x = design, file = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
                quote = F, row.names = F, sep = "\t")
    write.table(x = raw_data, file = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
                quote = F, row.names = F, sep = "\t")

    # NormalyzerDE aplication
    suppressMessages(normalyzer(jobName = paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
                                designPath = paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"),
                                dataPath = paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"),
                                outputDir = output_folder))
    cat(sprintf("File %s has been Normalyzed and the result has been saved in %s.\n", i,
                paste0(output_folder, "/", sub("_.*.?", "", i), "_NormalyzerDE\n")))

    
    unlink(paste0(output_folder, sub("_.*.?", "", i), "_design.tsv"), recursive = T)
    unlink(paste0(output_folder, sub("_.*.?", "", i), "_tmp_data.tsv"), recursive = T)
}

For the example data that was used to optimize this script and develop the corresponding Final Master Project, you can refer to the normalization summaries in the following files.

Normalized summary of Comparative A

Normalized summary of Comparative B

Normalized summary of Comparative C

Normalized summary of Comparative D

Normalized summary of Comparative E

# You can choose between the followings data normalized files: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. 
# If you do not want to use any normalized file set "None" 

norm_selected <- "/Quantile-normalized.txt"

if (norm_selected != "None") {
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_normalized") 
    data <- read.table(paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
                              norm_selected), header = T)
    
    # Rename rows according to their original names in the not normalized data. It is posible because the order is mainteined during the normalization process but 
    data2 <- read.csv(paste0(input_folder, "/", i))
    rownames(data2) <- data2$name
    group <- data2[1, -1] %>% rownames_to_column(); data2 <- data2[-1, -1]
    row.names(data) <- row.names(data2)
    
    # Reestructure the dataset by rearranging the columns
    data <- rownames_to_column(data, var = "name")
    data <- rbind(as.character(group), data)
    data[is.na(data)] <- 0
    
    assign(name, data)            
    write.csv(data, file = paste0(output_folder, "/", paste0(sub("_.*.?", "", i),
                                                             "_normalized.csv")), row.names = FALSE)
    
    cat(sprintf("File %s normalized by %s method has been saved in %s.\n", i,
                sub(".*/(.*?)\\-.*", "\\1", norm_selected),
                paste0(output_folder, "/", paste0(sub("_.*.?", "", i), "_normalized.csv \n"))))
  }
  
  files2 <- ls()
  files2 <- files2[grepl("normalized", files2, ignore.case = T)]
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_no_normalized")
    data_path <- file.path(input_folder, i)
    data <- read.csv(data_path)
    assign(name, data) 
    
    files <- ls()
    files <- files[grepl("no_normalized", files, ignore.case = T)]
  }
  
} else {
  
  for (i in files){
    # Set the new name for the normalized data that will be analysed
    name <- paste0(sub("_.*.?", "", i), "_no_normalized") 
    data_path <- file.path(input_folder, i)
    data <- read.csv(data_path)
    assign(name, data) 
    
    files <- ls()
    files <- files[grepl("no_normalized", files, ignore.case = T)]
  }

}
## File compA_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compA_normalized.csv 
## .
## File compB_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compB_normalized.csv 
## .
## File compC_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compC_normalized.csv 
## .
## File compD_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compD_normalized.csv 
## .
## File compE_merged_One_Factor.csv normalized by Quantile method has been saved in ../1_Data/Untargeted_metabolomic_data/Common_ready_to_use/Normalyze_data/compE_normalized.csv 
## .

Data visualization

Boxplot

A box plot is a visual summary of a dataset’s distribution. It helps to identify the center, spread, and skewness of the data. To assess data normality using a box plot, look for symmetry, absence of outliers, balanced box length, symmetric whisker length, and a median close to the box center.

This boxplot has been made thanks to “normalizeMets” package, which use ggplot2 graphs. In raw data case, it was log transformed to set similar axis as the ones of normalized data.

plots1 <- list()
plots2 <- list()

if (norm_selected != "None") {
  for (i in files){
      data <- get(i)
      
      # Prepare dataset without normalization
      group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
      met_names <- data$name %>% .[-1]
      data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
      data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
  
      # Original data visualization (log transformed and numeric format)
      plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
                       savetype = "png", interactiveplot = T)
      index <- length(plots1) + 1
      plots1[[index]] <- plot
  }
  
  for (i in files2){
    data <- get(i)
    
    # Prepare dataset normalized
    group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
    met_names <- data$name %>% .[-1]
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1]
    data <- data %>% t() %>% `colnames<-`(met_names)

    # Normalized data visualization (log transformed and numeric format)
    plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i),
                                         " normalized by ", sub(".*/(.*?)\\-.*", "\\1", norm_selected), " method"),
                       savetype = "png", interactiveplot = T)
    index <- length(plots2) + 1
    plots2[[index]] <- plot
  }
  
} else {
  
  for (i in files){
      data <- get(i)
      
      # Prepare dataset without normalization
      group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
      met_names <- data$name %>% .[-1]
      data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
      data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
  
      # Original data visualization (log transformed and numeric format)
      plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F, 
                       plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
                       savetype = "png", interactiveplot = T)
      index <- length(plots2) + 1
      plots2[[index]] <- plot
  }
}

Exploratory data analysis

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique that seeks to reduce the dimensionality of a dataset by maximizing the variance. It achieves this by transforming the original variables into a smaller set of uncorrelated variables called principal components. The goal is to retain as much relevant information as possible while minimizing the loss of information. PCA helps simplifying complex datasets, visualize data, and identify the most important patterns or features contributing to the overall variance.

if (norm_selected == "None") {
  for (i in files) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,])
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # PCA performance
    tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
    ## Boxplot
    tmp_eig <- tmp_pca$eig %>% .[1:5, ]
    barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
            main = paste("PCA of", sub("_.*.?", "", i), "no normalized"),
            xlab = "Principal Components",
            ylab = "Percentage of variances",
            col ="gold")
    ## Graph of individuals
    plot <- fviz_pca_ind(tmp_pca, col.ind = group, 
               pointsize=1, pointshape=21,fill="black",
               repel = T, 
               addEllipses = T, ellipse.type = "confidence",
               legend.title ="Conditions",
               show_legend = TRUE, show_guide = TRUE)
    labels <- rownames(data)
    plot <- plot + 
    ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i), "no normalized")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), "_no_normalized.png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
} else {
  for (i in files2) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,]); names <- colnames(data)
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # PCA performance
    tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
    ## Boxplot
    tmp_eig <- tmp_pca$eig %>% .[1:5, ]
    barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
            main = paste("PCA of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"),
            xlab = "Principal Components",
            ylab = "Percentage of variances",
            col ="gold")
    ## Graph of individuals
    plot <- fviz_pca_ind(tmp_pca, col.ind = group, 
               pointsize=1, pointshape=21,fill="black",
               repel = T, 
               addEllipses = T, ellipse.type = "confidence",
               legend.title ="Conditions",
               show_legend = TRUE, show_guide = TRUE)
    labels <- rownames(data)
    plot <- plot +
      ggtitle(paste("PC1 & PC2 of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/PCA_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
}

Hierarchical Clustering on Principal Components (HCPC) analysis

HCPC (Hierarchical Clustering on Principal Components) is a statistical method that combines hierarchical clustering and principal component analysis (PCA). It is used for exploratory analysis and clustering of multivariate data.

HCPC first performs PCA on the dataset to obtain principal components that capture the variability in the data. Then, it applies hierarchical clustering on these principal components to group similar observations together based on their distance or similarity measures.

HCPC is particularly useful when dealing with high-dimensional datasets, as it allows for a comprehensive exploration of the data by revealing the underlying clusters or patterns. It can assist in identifying subgroups within a dataset and gaining a better understanding of the relationships between variables.

if (norm_selected == "None") {
  for (i in files) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,])
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # HCA performance
    res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
    res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)  
    plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet", 
                      label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
                      type="rectangle", labels_track_height = 1000, horiz = T,repel = T) + 
      ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i), "no normalized")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/HCPC_plot_of", sub("_.*.?", "", i), "_no_normalized.png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
} else {
  for (i in files2) {
    # Dataset set-up
    data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
    group <- as.character(data[1 ,]); names <- colnames(data)
    data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
    # HCA performance
    res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
    res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)  
    plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet", 
                      label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
                      type="rectangle", labels_track_height = 1000, horiz = T, repel = T) + 
      ggtitle(paste("Cluster Dendrogram of", sub("_.*.?", "", i),
                                        "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
      theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
    ggsave(filename = paste0(image_folder, "/HCPC_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
    print(plot)
  }
}

Clear environment

——————————————————————————–


Stats

p-value

The p-value is a measure used in statistical hypothesis testing to determine the significance of a comparison. When the p-value is smaller than a significance level “α”, it indicates that there is sufficient evidence to reject the null hypothesis (H0). Most usually “α” value use to be 0.05 (5%) or 0.01 (1%).

In this research, where “α” = 0.05, the null hypothesis suggests that there are no significant differences between the amounts of the metabolite in each condition. By rejecting the H0 and accepting the alternative hypothesis (H1), we conclude that there are statistically significant differences in the amounts of the metabolite between the conditions being compared.

* H0 —> there is no difference between the sample groups for the amounts of the metabolites. * H1 —> There is a difference between the sample groups for the amounts of the metabolites.

  • If p-value < 0.05 we refuse the H0 and accept the H1. There are statistically significant differences.
  • If p-value > 0.05 we no refuse the H0. There are no statistically significant differences.

LFC

This indicates a significant difference in the metabolite’s amount between two different conditions. By fixing an LFC value of 1 or -1, we will consider as statistically significant only those metabolites that have a difference in amount of at least two-fold or greater in either direction.

If we fix an LFC value of 2 or -2, we will be more restrictive and consider only those metabolites that have a difference in amount of at least four-fold or greater in either direction. This approach will capture larger and more pronounced changes in metabolite levels.

files2 <- vector()

for (i in files){
  # Dataset set-up
  data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
  met_names <- row.names(data) %>% .[-1]
  group_rep <- as.character(data[1 ,]) %>% table(.)
  group <- as.character(data[1 ,]) %>% unique(.)  
  data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% `row.names<-` (., met_names)
  
  # Prepare limma assay
  experimental.design <- model.matrix(~ -1+factor(c(rep(1, group_rep[[1]]),rep(2, group_rep[[2]])))) %>%
    `colnames<-`(., group)
  
  # Limma assay (condition 2 vs. condition 1)
  linear.fit <- lmFit(data, experimental.design)
  contrast <- paste0(group[2],"-",group[1])
  contrast.matrix <- makeContrasts(contrast = contrast, levels = group)
  contrast.linear.fit <- contrasts.fit(linear.fit, contrast.matrix)
  contrast.results <- eBayes(contrast.linear.fit)
  
  # New dataframe with stats
  data_stats <- topTable(contrast.results, number = nrow(data), coef = 1)
  new_name <- paste0(sub("_.*.?", "", i), "_stats")
  assign(new_name, data_stats)
  
  files2 <- append(files2, new_name)
  
  output <- paste(rownames(data_stats[1, ]), "has the following statistics",
                  paste(colnames(data_stats), data_stats[1, ], sep = ": ", collapse = ", "))
  cat(sprintf("The file %s has been processed using the Limma package, comparing condition %s against condition %s, and the statistical information obtained has been saved in the file %s. Here is an example of the information it contains: metabolite %s\n", i, group[2], group[1], new_name, output))
  cat("\n")
}
## The file compA_normalized has been processed using the Limma package, comparing condition cA_S17 against condition cA_S25, and the statistical information obtained has been saved in the file compA_stats. Here is an example of the information it contains: metabolite Pseudouridine has the following statistics logFC: -2.8776344198033, AveExpr: 19.2391476963404, t: -28.6142616251652, P.Value: 3.53124446142366e-07, adj.P.Val: 8.22779959511713e-05, B: 7.73818466181533
## 
## The file compB_normalized has been processed using the Limma package, comparing condition cB_OTA_S17 against condition cB_noOTA_S25, and the statistical information obtained has been saved in the file compB_stats. Here is an example of the information it contains: metabolite thiodiglycol has the following statistics logFC: -4.02354883533047, AveExpr: 20.6764256590878, t: -33.0512253390896, P.Value: 6.35913284170357e-08, adj.P.Val: 1.0738125506286e-05, B: 9.23023126088469
## 
## The file compC_normalized has been processed using the Limma package, comparing condition cC_OTA against condition cC_noOTA, and the statistical information obtained has been saved in the file compC_stats. Here is an example of the information it contains: metabolite (S,S)-Anacine has the following statistics logFC: 11.7537798899378, AveExpr: 20.7159678775536, t: 42.4798956282828, P.Value: 7.70195141396493e-15, adj.P.Val: 4.08973620081538e-12, B: 21.8781524022808
## 
## The file compD_normalized has been processed using the Limma package, comparing condition cD_OTA against condition cD_noOTA, and the statistical information obtained has been saved in the file compD_stats. Here is an example of the information it contains: metabolite SPF-32629A has the following statistics logFC: -6.7298492072946, AveExpr: 14.9705154269062, t: -48.5771686424746, P.Value: 1.36206814674928e-20, adj.P.Val: 9.65706316045242e-18, B: 34.2453574248732
## 
## The file compE_normalized has been processed using the Limma package, comparing condition cE_OTA against condition cE_noOTA, and the statistical information obtained has been saved in the file compE_stats. Here is an example of the information it contains: metabolite 1-(4-Hydroxyphenyl)-2-aminoethanol has the following statistics logFC: -6.4448470289833, AveExpr: 22.9173710368963, t: -42.5519897138641, P.Value: 2.44718315116484e-08, adj.P.Val: 2.61958247328103e-06, B: 10.2286119933696
cat(sprintf("Vector files2 has the following information %s.\n", paste(files2, collapse = ", ")))
## Vector files2 has the following information compA_stats, compB_stats, compC_stats, compD_stats, compE_stats.

p-value adjustment

Multiple comparisons refer to the situation where several pairwise comparisons are made within a dataset or experiment. When conducting multiple comparisons, the probability of obtaining false positive results (Type I errors) increases. To address this issue, various statistical methods can be used to adjust the significance level (p-value) or control the overall error rate.

Methods to control the Family Wise Error (FWER):

It is a statistical concept that refers to the probability of making at least one Type I error (rejecting a true null hypothesis) among a set of multiple hypothesis tests. The FWER is a measure that quantifies the overall error rate when multiple hypotheses are tested simultaneously.

p-value adjustment by Bonferroni:

It involves dividing the significance level (p-value) by the number of comparisons performed. For example, if there are 10 comparisons being made and a desired overall significance level of 5%, the p-value threshold would be adjusted to 0.005 (0.05 divided by 10). If the p-value obtained in a comparison is lower than this adjusted threshold, it is considered statistically significant. This method adjusts the significance level for each individual hypothesis to maintain the desired FWER.

Methods to control the False Discovery Rate (FDR):

The “BH” and “BY” methods of Benjamini, Hochberg, and Yekutieli control the false discovery rate (FDR), the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.

p-value adjustment by Benjamini-Hochberg (BH):

The Benjamini-Hochberg method is another technique used to control the type I error in multiple comparisons. Unlike Bonferroni, B-H is less conservative and can be more powerful in detecting statistically significant differences. Instead of adjusting each individual p-value, B-H controls the expected proportion of false discoveries among the rejected hypotheses. This is achieved by ordering the p-values from lowest to highest, calculating a critical value based on the desired FDR rate, and comparing it to each p-value in order. P-values below the critical value are considered statistically significant.

q- value:

Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

P-VALUE ADJUSTMENT

for (i in files2) {
  message(paste("Processing file:", i))
  data <- get(i)
  data$p_value_bonferroni <- p.adjust(data$P.Value, method = "bonferroni", n = length(data$P.Value))
  names(data)[names(data) == "adj.P.Val"] <- "p_value_BH"
  new_name <- paste0("q_value_", sub("_.*.?", "", i))
  tmp <- data$P.Value
  if (i == "compC_stats"){
    message(paste("Error in qvalue calculation for file:", i))
    tmp_q_value <- qvalue(tmp, pi0 = 1)
    message(paste("File", i, "processed successfully with pi0 = 1"))
    } else {
      tmp_q_value <- qvalue(tmp)
      message(paste("File", i, "processed successfully"))
    }
  data$q_value <- tmp_q_value$qvalues
  assign(new_name, tmp_q_value)
  assign(i, data)
}
## Processing file: compA_stats
## File compA_stats processed successfully
## Processing file: compB_stats
## File compB_stats processed successfully
## Processing file: compC_stats
## Error in qvalue calculation for file: compC_stats
## File compC_stats processed successfully with pi0 = 1
## Processing file: compD_stats
## File compD_stats processed successfully
## Processing file: compE_stats
## File compE_stats processed successfully
Comparative A statistics of cA_S17 Vs. cA_S25
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
Pseudouridine -2.877634 19.23915 -28.61426 3.5e-07 8.2e-05 7.738185 8.2e-05 4.7e-05
Uracil;2_4-Dihydroxypyrimidine -3.473787 23.17253 -22.33493 1.4e-06 1.6e-04 6.395329 3.2e-04 9.1e-05
1-(4-Hydroxyphenyl)-2-aminoethanol 2.165694 25.21030 18.68856 3.6e-06 2.8e-04 5.378592 8.3e-04 1.6e-04
Asenjonamide C 2.759073 23.48339 15.76866 8.9e-06 4.0e-04 4.383495 2.1e-03 2.3e-04
[FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid 2.009285 22.92325 15.74301 9.0e-06 4.0e-04 4.373868 2.1e-03 2.3e-04
N-Acetyl-beta-D-galactosamine -1.107902 21.33208 -14.88642 1.2e-05 4.0e-04 4.042318 2.8e-03 2.3e-04
Comparative B statistics of cB_OTA_S17 Vs. cB_noOTA_S25
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
thiodiglycol -4.023549 20.67643 -33.05123 6.4e-08 1.1e-05 9.230231 2.1e-05 3.5e-06
2-Oxovalericacid 5.599484 21.30886 32.92127 6.5e-08 1.1e-05 9.210657 2.1e-05 3.5e-06
Malyngolide 2.714021 20.20551 22.59119 5.9e-07 6.5e-05 7.173198 1.9e-04 2.1e-05
Hexadeca-5,9-dienoic acid 2.160233 19.49440 19.31046 1.5e-06 1.2e-04 6.248042 4.9e-04 3.9e-05
Sclerotigenin 2.395235 20.99894 17.74552 2.4e-06 1.4e-04 5.736949 8.0e-04 4.4e-05
Sulcatone 2.701423 21.72628 17.66307 2.5e-06 1.4e-04 5.708569 8.2e-04 4.4e-05
Comparative C statistics of cC_OTA Vs. cC_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
(S,S)-Anacine 11.753780 20.71597 42.47990 7.7e-15 4.1e-12 21.87815 4.1e-12 4.1e-12
Tanikolide 7.025822 17.89991 37.16378 4.0e-14 1.1e-11 20.91319 2.1e-11 1.1e-11
1,4-Anhydro-5-O-tetradecanoyl-D-glucitol 8.088456 16.14435 35.70987 6.6e-14 1.2e-11 20.60305 3.5e-11 1.2e-11
ProstaglandinF1a 7.574203 16.54710 33.44378 1.5e-13 2.0e-11 20.07219 7.8e-11 2.0e-11
4,10-dihydroxy-10-methyl-dodecan-4-olide 6.853155 17.48641 32.11155 2.4e-13 2.5e-11 19.73007 1.3e-10 2.5e-11
Takinolide seco-acid 5.883005 17.23582 31.71270 2.8e-13 2.5e-11 19.62293 1.5e-10 2.5e-11
Comparative D statistics of cD_OTA Vs. cD_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
SPF-32629A -6.729849 14.97052 -48.57717 1.4e-20 9.7e-18 34.24536 9.7e-18 5.0e-18
2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine -4.048766 20.74461 -23.79326 4.4e-15 1.6e-12 24.43944 3.1e-12 8.0e-13
6-formylsalicylic acid -6.304306 15.21251 -21.50028 2.6e-14 6.1e-12 22.82836 1.8e-11 3.1e-12
C17H37P3 -5.007224 19.52357 -17.72204 7.3e-13 1.3e-10 19.68782 5.2e-10 6.6e-11
C31H42O3P2 -4.530932 17.59934 -16.75030 1.9e-12 2.7e-10 18.76162 1.3e-09 1.4e-10
C6H16N6O4P2 2.178261 20.59727 16.19640 3.4e-12 4.0e-10 18.20842 2.4e-09 2.0e-10
Comparative E statistics of cE_OTA Vs. cE_noOTA
logFC AveExpr t P.Value p_value_BH B p_value_bonferroni q_value
1-(4-Hydroxyphenyl)-2-aminoethanol -6.444847 22.91737 -42.55199 2.4e-08 2.6e-06 10.228612 4.6e-06 8.8e-07
Asenjonamide C -7.371831 20.86065 -41.66516 2.8e-08 2.6e-06 10.126882 5.2e-06 8.8e-07
C4H9N5O6 2.730207 19.94422 30.90247 1.5e-07 7.8e-06 8.571225 2.8e-05 2.6e-06
(R)-4-Dehydropantoate 3.828108 22.02130 30.38516 1.6e-07 7.8e-06 8.477877 3.1e-05 2.6e-06
C30H52NP3 4.262037 19.66849 26.42986 3.6e-07 1.4e-05 7.688768 6.9e-05 4.6e-06
1(2H)-Isoquinolinone -3.563701 21.78231 -24.55777 5.5e-07 1.7e-05 7.261687 1.0e-04 5.8e-06

P-VALUE REPRESENTATION

for (i in files2){
  # Q-value plots:
  q_value_plot <- paste0("q_value_", sub("_.*.?", "", i))
  q_value_plot <- get(q_value_plot)
  plot(q_value_plot, main = paste("Q-value plots of file", i)) 
  
  # P-values histograms:
  data <- get(i)
  hist(data$P.Value, main = paste("Original p-value of file", i), 
       xlab = "p-value", col = "gold")
}

Clear environment

——————————————————————————–


Graphs and results of significant metabolites

Significant metabolites

## SIGNIFICANT METABOLITES in file compA_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 233. 
## 
## The significant metabolites in file compA_stats by original p-values obteined by limma are 16. 
##   - 7 are increased. 
##   - 9 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
## 
## The significant metabolites in file compA_stats by Bonferroni p-values are 10. 
##   - 5 are increased. 
##   - 5 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
## 
## The significant metabolites in file compA_stats by Benjamini-Hochberg p-values are 16. 
##   - 7 are increased. 
##   - 9 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
## 
## The significant metabolites in file compA_stats by q-value testare 16. 
##   - 7 are increased. 
##   - 9 are decreased. 
##   - The most significant metabolite increased in cA_S17 condition is 1-(4-Hydroxyphenyl)-2-aminoethanol. 
##   - Whereas the most significant metabolite decreased in cA_S17 condition is Pseudouridine.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compB_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 330. 
## 
## The significant metabolites in file compB_stats by original p-values obteined by limma are 19. 
##   - 9 are increased. 
##   - 10 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
## 
## The significant metabolites in file compB_stats by Bonferroni p-values are 12. 
##   - 8 are increased. 
##   - 4 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
## 
## The significant metabolites in file compB_stats by Benjamini-Hochberg p-values are 18. 
##   - 9 are increased. 
##   - 9 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
## 
## The significant metabolites in file compB_stats by q-value testare 19. 
##   - 9 are increased. 
##   - 10 are decreased. 
##   - The most significant metabolite increased in cB_OTA_S17 condition is 2-Oxovalericacid. 
##   - Whereas the most significant metabolite decreased in cB_OTA_S17 condition is thiodiglycol.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compC_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 531. 
## 
## The significant metabolites in file compC_stats by original p-values obteined by limma are 160. 
##   - 80 are increased. 
##   - 80 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
## 
## The significant metabolites in file compC_stats by Bonferroni p-values are 99. 
##   - 60 are increased. 
##   - 39 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
## 
## The significant metabolites in file compC_stats by Benjamini-Hochberg p-values are 154. 
##   - 78 are increased. 
##   - 76 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
## 
## The significant metabolites in file compC_stats by q-value testare 154. 
##   - 78 are increased. 
##   - 76 are decreased. 
##   - The most significant metabolite increased in cC_OTA condition is (S,S)-Anacine. 
##   - Whereas the most significant metabolite decreased in cC_OTA condition is C17H35P3.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compD_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 709. 
## 
## The significant metabolites in file compD_stats by original p-values obteined by limma are 64. 
##   - 35 are increased. 
##   - 29 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
## 
## The significant metabolites in file compD_stats by Bonferroni p-values are 33. 
##   - 14 are increased. 
##   - 19 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
## 
## The significant metabolites in file compD_stats by Benjamini-Hochberg p-values are 55. 
##   - 28 are increased. 
##   - 27 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
## 
## The significant metabolites in file compD_stats by q-value testare 60. 
##   - 32 are increased. 
##   - 28 are decreased. 
##   - The most significant metabolite increased in cD_OTA condition is C6H16N6O4P2. 
##   - Whereas the most significant metabolite decreased in cD_OTA condition is SPF-32629A.
## 
## |------------------------|
## 
## SIGNIFICANT METABOLITES in file compE_stats with a LFC value of 2 and a α value of 0.01. 
## 
## Total number of metabolites is 190. 
## 
## The significant metabolites in file compE_stats by original p-values obteined by limma are 29. 
##   - 14 are increased. 
##   - 15 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is C4H9N5O6. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## 
## The significant metabolites in file compE_stats by Bonferroni p-values are 21. 
##   - 9 are increased. 
##   - 12 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is C4H9N5O6. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## 
## The significant metabolites in file compE_stats by Benjamini-Hochberg p-values are 29. 
##   - 14 are increased. 
##   - 15 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is C4H9N5O6. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## 
## The significant metabolites in file compE_stats by q-value testare 29. 
##   - 14 are increased. 
##   - 15 are decreased. 
##   - The most significant metabolite increased in cE_OTA condition is C4H9N5O6. 
##   - Whereas the most significant metabolite decreased in cE_OTA condition is 1-(4-Hydroxyphenyl)-2-aminoethanol.
## 
## |------------------------|

Representation of significant metabolites according to the p-value used

for (i in files2){
  data <- get(i)
  # Obtain the significant metabolites by original p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & P.Value < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & P.Value < stats_values$alpha_value))
  Original_mets <- row.names(significant)
  # Obtain the significant metabolites by BH p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_BH < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value))
  BH_mets <- row.names(significant)
  # Obtain the significant metabolites by original p-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value))
  Bonferroni_mets <- row.names(significant)
  # Obtain the significant metabolites by q-values
  significant <- subset(data, (logFC > stats_values$LFC_limit & q_value < stats_values$alpha_value) |
                          (logFC < -stats_values$LFC_limit & q_value < stats_values$alpha_value))
  Qvalue_mets <- row.names(significant)
  
  significant <- list(Original_mets = Original_mets, BH_mets = BH_mets, Bonferroni_mets = Bonferroni_mets,
                      Qvalue_mets = Qvalue_mets)

  # Performance Venn diagram
  random_color <- sample(colors(), 1)
  plot <- ggVennDiagram(significant, label_alpha = 0.4) +
    scale_fill_gradient(low="white", high = random_color) + labs(fill = "Significant \nmetabolites") +
    ggtitle(paste("Venn Diagram of file", i,"with the significant \nmetabolites according to the p-value used")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
  print(plot)
  
  # Upset diagram
  plot <- upset(fromList(significant),
      number.angles = 0, point.size = 3, line.size = 1,
      sets.x.label = "Number of significant \n metabolites",
      mainbar.y.label = paste("Upset Diagram of file", i,
                              "\nwith the significant metabolites \naccording to the p-value used
                              \n\nIntersection Size"), 
      set_size.show = T, matrix.color = random_color,
      set_size.scale_max = max(sapply(significant, length))+50,
      text.scale = c(1.5, 1.5, 1, 1, 1.4, 2), order.by = "freq") 
  print(plot)
}

Mean-Average (MA) plot

# Set the p_value adjustment method selected to use in the following steps. You can choose between: p_value_BH, p_value_bonferroni, q_value and P.Value
selected_p_adjustment <- "p_value_BH"

## Create a function that bind columns and complete the missing rows with NA values
cbind.fill <- function(...){
  nm <- list(...) 
  nm<-lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) {
    rbind(x, matrix(NA, n-nrow(x), ncol(x)))
  }))
}

for (i in files2) {
  # Prepare datasets
  data <- get(i)
  names(data) <- gsub("logFC", "log2FoldChange", gsub("AveExpr", "baseMean", gsub(selected_p_adjustment, "padj", names(data))))
  data <- data[c("log2FoldChange", "baseMean", "padj")]
  
  # Save significant metabolites as .csv file, filling with NA values till the end of the dataframe.
  ## Select the desired metabolites to save
  increased <- subset(data, log2FoldChange > stats_values$LFC_limit  & padj < stats_values$alpha_value) %>%
    rownames_to_column() %>% .[, 1] %>% as.data.frame()
  decreased <- subset(data, log2FoldChange < - stats_values$LFC_limit  & padj < stats_values$alpha_value) %>%
    rownames_to_column() %>% .[, 1] %>% as.data.frame()
  ## Save data
  significant_data <- cbind.fill(increased, decreased) %>% `colnames<-`(lm_contrast[sub("_.*.?", "", i)][[1]])
  write.csv(significant_data, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i),
                                                             "_significant_mets.csv")), row.names = FALSE)
  
  # Performance MA plot
  plot <- ggmaplot(
    data, fdr = stats_values$alpha_value, fc = stats_values$LFC_limit,
    genenames = row.names(data), detection_call = NULL, size = 1.5,
    alpha = 1, seed = 42, font.label = c(10, "italic", "black"),
    label.rectangle = F, palette = c("#B31B21", "#1465AC", "darkgray"), 
    top = 4, select.top.method = c("padj", "fc"), label.select = NULL,
    main = NULL, xlab = "Log2 mean expression", ylab = "Log2 fold change",
    ggtheme = theme_pubr()) + 
    ggtitle(paste0("MA plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
                  " Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
  
  ggsave(filename = paste0(image_folder, "/MA_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
  
  print(plot)
}

Volcano plot of significant metabolites

j = 0
plots <- list()
tables1 <- list()
RDS_files <- list()

for (i in files2){
  data <- get(i) %>% rownames_to_column("Metabolites") %>% .[c("Metabolites", "logFC", selected_p_adjustment)]
  data <- data %>% 
    mutate(Significant = ifelse(abs(logFC) < stats_values$LFC_limit & p_value_BH > stats_values$alpha_value, "ns",
                         ifelse(logFC >= stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "up",
                                ifelse(logFC <= -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "down",
                                       "ns"))))
  
  # Make a table with the 10 most significant metabolites to each condition
  mets <- bind_cols(data %>% arrange(desc(Significant)) %>% head(10) %>% .$Metabolites, 
                    data %>% arrange(Significant) %>% head(10) %>% .$Metabolites) %>% 
    `colnames<-`(c(lm_contrast[sub("_.*.?", "", i)][[1]][1], lm_contrast[sub("_.*.?", "", i)][[1]][2]))
  ## Save data set in a list with the correspondence name
  RDS_files[[sub("_.*.?", "", i)]] <- mets
  
  j <- j + 1
  
  table <- mets %>% 
    kbl(caption = paste("Table", j,": The 10 most significant metabolites to each condition of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")
  
  index <- length(tables1) + 1
  tables1[[index]] <- table
  
  # Make a dataframe with the most significant metabolites, in order to fix them in the volcano plot
  mets <- list()
  mets <- bind_rows(data %>% arrange(desc(Significant)) %>% head(5), 
                    data %>% arrange(Significant) %>% head(5))
  
  plot <- data %>% 
    ggplot(aes(x=logFC, y=-log10(p_value_BH), color=Significant,
              text=paste0("</br>Metabolite: ", Metabolites,
                          "</br>Adj p-value: ", p_value_BH,
                         "</br>LFC: ", logFC))) +
    geom_point(alpha=0.75, shape=16) + xlim(-15,15) +
    theme_gray() + theme(legend.position = "bottom") + 
    ggtitle(paste0("Volcano plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
                  " Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
    theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold")) +
    labs(color = "Significant \nMetabolite") + 
    scale_color_manual(values = c("up" = "#faaa11", "down" = "#26b3ff", "ns" = "grey")) +
    geom_point(data = mets, aes(color = Significant), size = 4) +
    geom_text_repel(data = mets, aes(label = Metabolites), size = 3, color = "black")

  print(plot)
  ggsave(filename = paste0(image_folder, "/Volcano_plot_of_", sub("_.*.?", "", i), ".png"),
         plot = plot, width = 8, height = 6, dpi = 300)
  
  index <- length(plots) + 1
  plots[[index]] <- plot
}
## New names:
## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

## New names:
## • `` -> `...1`
## • `` -> `...2`

#if (imputed == "Yes") {
#  saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets.rds")
#} else {
#  saveRDS(RDS_files, file = "../../Tables_TFM/significant_mets_common.rds")
#}

Dynamic Volcano plot of significant metabolites

Tables with the most significant metabolites for each comparative

tables1[[1]]
Table 1 : The 10 most significant metabolites to each condition of compA are:
cA_S17 cA_S25
1-(4-Hydroxyphenyl)-2-aminoethanol Pseudouridine
Asenjonamide C Uracil;2_4-Dihydroxypyrimidine
[FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid L-Phenylalanine
Benzylamine Cytisine
Geranylacetate Peniciginseng B
Photopyrone G 6-heptyl-2H-pyran-2-one
10_16-Dihydroxyhexadecanoicacid ?-(Methylenecyclopropyl)glycine
N-Acetyl-beta-D-galactosamine (-)-Jasmonicacid
Palmitoleicacid N-(6-Aminohexanoyl)-6-aminohexanoate
(-)-alpha-Cedrene N-Acetyl-beta-D-galactosamine
tables1[[2]]
Table 2 : The 10 most significant metabolites to each condition of compB are:
cB_OTA_S17 cB_noOTA_S25
2-Oxovalericacid thiodiglycol
Malyngolide allopurinol
Hexadeca-5,9-dienoic acid N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide
Sclerotigenin Uracil;2_4-Dihydroxypyrimidine
Sulcatone Xanthine
2’-Deoxyinosine Pseudouridine
(S,S)-Anacine 5-hydroxymethylfurfural
2-{4-(Cyclohexylamino)-1-methyl-1H-pyrazolo[3,4-d]pyrimidin-6-ylamino}ethanol 2_2’-Iminodipropanoate
Aflatoxin b2 Norepinephrine(Noradrenaline)
C12H29N4P C12H29N4P
tables1[[3]]
Table 3 : The 10 most significant metabolites to each condition of compC are:
cC_OTA cC_noOTA
(S,S)-Anacine C17H35P3
Tanikolide Pyrenochaetic acid F
1,4-Anhydro-5-O-tetradecanoyl-D-glucitol 3-{2-[2-(3-Azidopropoxy)ethoxy]ethoxy}-1-propanamine
ProstaglandinF1a Inosine
4,10-dihydroxy-10-methyl-dodecan-4-olide allopurinol
Takinolide seco-acid C15H33P3
N-{2-Methyl-5-[(3-methylbutanoyl)amino]phenyl}-9-oxooctahydro-2H-pyrazino[1,2-a]pyrazine-2-carboxamide PALGLY
ethyl 9,9-dimethoxynonanoate C4H9N8O5P
Neolymphostinol C Methyl N6-acetyl-L-lysinate
2’-Deoxyinosine Appenolide C
tables1[[4]]
Table 4 : The 10 most significant metabolites to each condition of compD are:
cD_OTA cD_noOTA
C6H16N6O4P2 SPF-32629A
6-Methoxy-8<6’-hydroxy-2’-methoxy-4’(3’‘-methyl-2’’-butenyl)phenoxy>2,2-dimethylchromene 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
N5-Ethyl-L-glutamine 6-formylsalicylic acid
Acetyl-L-carnitine C17H37P3
3-Methylbut-2-enal C31H42O3P2
1-Pentylamine 1-O-Tetradecanoyl-D-glucitol
6,7-dihydroxy-4,5,6,7-tetrahydroindole-4-one Pentaerythritol laurate
C21H38N5O7P C4H9N8O5P
Asperterrestide A 8-(4-Ethyl-1-piperazinyl)-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione
2-Methyl-2-propanyl 1,4,10-trioxa-7,13-diazacyclopentadecane-7-carboxylate allopurinol
tables1[[5]]
Table 5 : The 10 most significant metabolites to each condition of compE are:
cE_OTA cE_noOTA
C4H9N5O6 1-(4-Hydroxyphenyl)-2-aminoethanol
(R)-4-Dehydropantoate Asenjonamide C
C30H52NP3 1(2H)-Isoquinolinone
3-Dehydrocarnitine Maristachone B
1_6_6-Trimethyl-2_7-dioxabicyclo[3.2.2]nonan-3-one Capsidiol
Tanikolide N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide
1_2_4-Trimethylbenzene Phloionolic acid
2-Oxo-delta3-4_5_5-trimethylcyclopentenylacetate (6RS,10RS)-6,10-dimethylbicyclo[4.4.0]dec-1-en-3-one
[STtrihydrox]3alpha_11beta_21-5alpha-trihydroxy-pregnane-20-one 8-{[2-(Diethylamino)ethyl]amino}-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione
N-(2-Hexyl-4-oxo-3(4H)-quinazolinyl)-4-hydroxy-1-isobutyl-2-oxo-1,2,5,6,7,8-hexahydro-3-quinolinecarboxamide AminoDHQ

Clear environment

——————————————————————————–


Multivariant analysis

sPLS-DA

PLS-DA is a statistical method used for classification. It helps to predict or classify samples into different groups based on a set of variables. It is particularly useful for high-dimensional data like metabolomics. By applying PLS-DA, you can identify the important variables that distinguish between different groups, providing valuable insights for further analysis.

In PLS-DA, the method combines the principles of Partial Least Squares (PLS) regression and Discriminant Analysis. It aims to find a linear combination of the predictor variables that maximally explains the variation in the response variable while also maximizing the separation between different groups or categories.

In this case, given the large number of variables available, we decided to perform a sparse PLS-DA analysis. We limited the analysis to 2 components based on the good separation achieved in the PCA assay. Additionally, we reduced the number of variables to 25.

The prediction capacity of the “pls” model will be analysed by using the “predict” function from mixOmics package. The algorithim used to identify the similarities between the model and the new data is the Mahalanobis distance.

tables1 <- list()
tables2 <- list()
RDS_files <- list()
RDS_files1 <- list()

for (i in files) {
  j <- j + 1 
  data <- get(i)
  Y <- as.character(data[1 ,]) %>% .[-1] %>% as.factor(.)
  mets <- (data[, 1]) %>% .[-1] %>% as.data.frame() %>% `colnames<-`(., "Mets_names") %>% 
    mutate(ID = paste0("Metabolite_", 1:(nrow(data)-1)))
  write.csv(mets, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_mets_names.csv")), row.names = FALSE)
  data <- data[-1, -1] %>% mutate_all(as.numeric) %>% t() %>% `colnames<-`(., mets$ID)
  
  
  # Prepare data for the sPLS-DA 
  ## Remove variables with zero standard deviation
  sd_values <- apply(data, 2, sd)
  filtered_data <- data[, sd_values > 0] %>% as.data.frame(); X <- filtered_data
  
  # Perform sPLS-DA
  tmp_pls <- mixOmics::splsda(X, Y, ncomp = 2, 
                                 keepX = 25, 
                                 scale = FALSE)

  
  # Prediction capacity
  ## Obtain the mean sd value calculated per sample in the dataset
  mean_sd <- mean(apply(X, 1, sd))
  ## Randomly select 5 observations from the X matrix
  set.seed(123)
  s <- sample(1:nrow(X), 5)
  Xnew = X[s, , drop = FALSE]
  ## Modify original samples
  Xnew = t(apply(Xnew, 1, function(x){x + rnorm(ncol(X), 0, mean_sd/10)}))
  ## Perform de prediction
  myprediction = predict(tmp_pls, Xnew, dist = "mahalanobis.dist")
  ## Save data set in a list with the correspondence name
  RDS_files[[sub("_.*.?", "", i)]] <- myprediction$class
  
  table <- myprediction$class %>% 
    kbl(caption = paste("Table", j,": The predicted classes for the randomly and edited samples of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")
  
  index <- length(tables1) + 1
  tables1[[index]] <- table
  
  
  # The 10 most important predictor variables selection to each group
  ## Those with positive weight in component 1 -> group_1
  group_1 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(desc(comp1)) %>% 
    head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
  ## Those with negative weight in component 1 -> group_2
  group_2 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(comp1) %>% 
    head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
  
  ## If the sign of the weight for group 1 in the first component is negative, it means that the metabolites associated with group 1 by the sPLS-DA technique will have a negative weight. On the other hand, if the sign of the weight for group 1 is positive, it indicates that the metabolites associated with group 1 by the sPLS-DA technique will have a positive weight.
  if (sign(tmp_pls$loadings$Y[1,1]) < 0){
    group <- tmp_pls$loadings$Y %>% row.names()
  } else {
    group <- tmp_pls$loadings$Y %>% row.names() %>% rev()
  }
  VIP_metabolites <- cbind(group_2, group_1) %>% `colnames<-`(., group)
  
  table <- VIP_metabolites %>% 
    kbl(caption = paste("Table", j+5,": The 10 most important predictor variables of",
                        sub("_.*.?", "", i), "are: ")) %>% 
    kable_classic(full_width = F, html_font = "Cambria")

  index <- length(tables2) + 1
  tables2[[index]] <- table
  
  
  # Save PLS variables loadings in RDS files 
  ## Change ID by their metabolite name
  pls_loadings <- tmp_pls$loadings$X %>% .[, 1] %>% as.data.frame() 
  new_row_names <- rownames(pls_loadings) %>% {mets$Mets_names[match(., mets$ID)]}
  rownames(pls_loadings) <- new_row_names 
  ## Split the data frame in a positive and negative one. Without 0 weight
  positive_weight <- subset(pls_loadings, . > 0) %>% arrange(.) %>% `colnames<-`(., group[2])
  negative_weight <- subset(pls_loadings, . < 0) %>% arrange(.) %>% `colnames<-`(., group[1])
  ## Save both data set in a list with the correspondence name
  list_1 <- list()
  list_1[["positive_weight"]] <- positive_weight; list_1[["negative_weight"]] <- negative_weight
  RDS_files1[[sub("_.*.?", "", i)]] <- list_1
  
  
  # GRAPHS
  ## Loadings
  plotLoadings(tmp_pls, comp = 1, method = 'mean', contrib = 'max',
               title = paste("Loadings of the 25 principal \nmetabolites of", sub("_.*.?", "", i)))

  ## Variables (Metabolites)
  plotVar(tmp_pls, comp = c(1,2), 
        var.names = F, cex = 3, cutoff = 0.8,
        title = paste("Correlation plot of the 25 \nprincipal metabolites of", sub("_.*.?", "", i)))

  ## HeatMap
  #Execute directly in console
  X11()
  cim(tmp_pls, comp = 1,
      xlab = "Metabolites", ylab = "Samples", margins = c(7, 15),
      title = paste("Heatmap of", sub("_.*.?", "", i), 
                    "\nnormalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"), lwid = c(0.1, 0.9)) 
  dev.off()

  ## Individual samples per component
  my_colors <- c("steelblue3", "green3")
  plotIndiv(tmp_pls, comp = 1:2,
                group = Y, col = my_colors, style = "ggplot2", 
                title = paste("PC1 & PC2", i), 
                legend = TRUE, legend.position = "right", 
                legend.title = "Sampling condition", ellipse = TRUE, 
                ellipse.level = 0.95, centroid = FALSE)
}

#if (imputed == "Yes") {
#  saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results.rds")
#  saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives.rds")
#} else {
#  saveRDS(RDS_files, file = "../../Supplementary_material/Supplementary_tables/predictions_results_common.rds")
#  saveRDS(RDS_files1, file = "../../Tables_TFM/loadings_pls_comparatives_common.rds")
#}

Tables

Tables of predictions results:

Table 6 : The predicted classes for the randomly and edited samples of compA are:
comp1 comp2
area_2_3_15_S25_noOTA cA_S25 cA_S25
area_2_3_15_S17_noOTA cA_S17 cA_S17
area_2_2_15_S25_noOTA cA_S25 cA_S25
area_2_1_15_S17_noOTA cA_S17 cA_S17
area_2_2_15_S17_noOTA cA_S17 cA_S17
Table 7 : The predicted classes for the randomly and edited samples of compB are:
comp1 comp2
area_8_3_15_S25_noOTA cB_noOTA_S25 cB_noOTA_S25
area_8_3_15_S17_OTA cB_OTA_S17 cB_OTA_S17
area_8_2_15_S25_noOTA cB_noOTA_S25 cB_noOTA_S25
area_8_1_15_S17_OTA cB_OTA_S17 cB_OTA_S17
area_8_2_15_S17_OTA cB_OTA_S17 cB_OTA_S17
Table 8 : The predicted classes for the randomly and edited samples of compC are:
comp1 comp2
area_2_3_15_J17_noOTA cC_noOTA cC_noOTA
area_10_3_15_J25_OTA cC_OTA cC_OTA
area_10_1_15_J25_OTA cC_OTA cC_OTA
area_2_2_15_J17_noOTA cC_noOTA cC_noOTA
area_2_3_15_J25_noOTA cC_noOTA cC_noOTA
Table 9 : The predicted classes for the randomly and edited samples of compD are:
comp1 comp2
area_11_3_15_J25_OTA cD_OTA cD_OTA
area_11_2_15_J25_OTA cD_OTA cD_OTA
area_11_3_856_S25_noOTA cD_noOTA cD_noOTA
area_11_1_856_J25_OTA cD_OTA cD_OTA
area_11_2_856_S25_noOTA cD_noOTA cD_noOTA
Table 10 : The predicted classes for the randomly and edited samples of compE are:
comp1 comp2
area_2_3_15_S17_noOTA cE_noOTA cE_noOTA
area_8_3_15_S17_OTA cE_OTA cE_OTA
area_2_2_15_S17_noOTA cE_noOTA cE_noOTA
area_8_1_15_S17_OTA cE_OTA cE_OTA
area_8_2_15_S17_OTA cE_OTA cE_OTA

Tables of most important predictor variables:

Table 11 : The 10 most important predictor variables of compA are:
cA_S17 cA_S25
Asenjonamide C Cytisine
Photopyrone G Uracil;2_4-Dihydroxypyrimidine
1-(4-Hydroxyphenyl)-2-aminoethanol L-Phenylalanine
Geranylacetate Pseudouridine
Benzylamine Peniciginseng B
10_16-Dihydroxyhexadecanoicacid ?-(Methylenecyclopropyl)glycine
[FAhydroxy(18:0)]12_13-dihydroxy-9Z-octadecenoicacid (-)-Jasmonicacid
N-(5-Methyl-1,2-oxazol-3-yl)-4-[(phenylcarbamoyl)amino]benzenesulfonamide N-(6-Aminohexanoyl)-6-aminohexanoate
NP-002322 6-heptyl-2H-pyran-2-one
C12H25N4P NP-015145
Table 12 : The 10 most important predictor variables of compB are:
cB_noOTA_S25 cB_OTA_S17
thiodiglycol 2-Oxovalericacid
N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide Malyngolide
Xanthine Sulcatone
allopurinol 2-{4-(Cyclohexylamino)-1-methyl-1H-pyrazolo[3,4-d]pyrimidin-6-ylamino}ethanol
Uracil;2_4-Dihydroxypyrimidine Aflatoxin b2
Pseudouridine Sclerotigenin
5-hydroxymethylfurfural (S,S)-Anacine
Confluenine E 2’-Deoxyinosine
2_2’-Iminodipropanoate Hexadeca-5,9-dienoic acid
Norepinephrine(Noradrenaline) Isoprene
Table 13 : The 10 most important predictor variables of compC are:
cC_OTA cC_noOTA
(S,S)-Anacine Asenjonamide C
2’-Deoxyinosine C17H35P3
Purine (-)-alpha-Cedrene
Neolymphostinol C (-)-Isopiperitenone
Aurantiomide C (-)-Jasmonicacid
1,4-Anhydro-5-O-tetradecanoyl-D-glucitol (-)-monascumic acid
Methyl 16-hydroxy-12-oxohexadecanoate (-)-trans-Carveol
Aurantiomide B (+)-Bornane-2_5-dione
Sclerotigenin (+)-ethyl homononactate
ProstaglandinF1a (+)-exo-5-Hydroxycamphor
Table 14 : The 10 most important predictor variables of compD are:
cD_OTA cD_noOTA
Acetyl-L-carnitine 1D-chiro-inositol
Ochratoxin A SPF-32629A
1-Pentylamine 6-formylsalicylic acid
MFCD00054545 [FAoxo_amino(6:0)]3-oxo-5S-amino-hexanoicacid
Adenine C17H37P3
Maniwamycin C 8-(4-Ethyl-1-piperazinyl)-3-methyl-7-(2-methyl-2-propen-1-yl)-3,7-dihydro-1H-purine-2,6-dione
2’-Deoxyinosine C31H42O3P2
Purine Pentaerythritol laurate
N5-Ethyl-L-glutamine 2-(Formamido)-N1-(5-phospho-D-ribosyl)acetamidine
Cytisine allopurinol
Table 15 : The 10 most important predictor variables of compE are:
cE_OTA cE_noOTA
Tanikolide Asenjonamide C
C30H52NP3 1-(4-Hydroxyphenyl)-2-aminoethanol
(R)-4-Dehydropantoate 1(2H)-Isoquinolinone
2-Oxo-delta3-4_5_5-trimethylcyclopentenylacetate N,N,N’,N’-Tetrakis(2-methoxyethyl)malonamide
3-Dehydrocarnitine Maristachone B
[STtrihydrox]3alpha_11beta_21-5alpha-trihydroxy-pregnane-20-one Xanthine
Isoprene 1-Methylpyrrolinium
C4H9N5O6 AminoDHQ
1_6_6-Trimethyl-2_7-dioxabicyclo[3.2.2]nonan-3-one Phloionolic acid
C12H14N10 6,7-DIETHYL-1,1,4,4-TETRAMETHYLTETRALINE

HeatMap

HeatMaps from the pls data. To see more information run the code in the console and consulte the graph.

HeatMap comparative A
HeatMap comparative A
HeatMap comparative B
HeatMap comparative B
HeatMap comparative C
HeatMap comparative C
HeatMap comparative D
HeatMap comparative D
HeatMap comparative E
HeatMap comparative E

HeatMap of PLS-DA (all metabolites)

HeatMap comparative A of all metabolites
HeatMap comparative A of all metabolites
HeatMap comparative B of all metabolites
HeatMap comparative B of all metabolites
HeatMap comparative C of all metabolites
HeatMap comparative C of all metabolites

HeatMap comparative D of all metabolites HeatMap comparative E of all metabolites

## ########  ##   ##  ######     ######  ##    ##  ######
##    ##     ##   ##  ##         ##      ###   ##  ##   ##
##    ##     ##   ##  ##         ##      ####  ##  ##    ##
##    ##     #######  ######     ######  ## ## ##  ##     ##
##    ##     ##   ##  ##         ##      ##  ####  ##    ##
##    ##     ##   ##  ##         ##      ##   ###  ##   ##
##    ##     ##   ##  ######     ######  ##    ##  ######
## Script completed successfully!